Introduction

In this workshop we will be analysing pre processed RNA sequencing data, specifically reads from the raw tumour RNA-seq that do NOT map to the human reference genome (GRCh38), we will refer these as unmapped reads.

By charecterising the origin of these non-human unmapped reads we can identify poteintal bacterial or viral species which may be present in the tumour microenvinonemnt of this patient.

Below is an overview of the pipeline:




Tools

All the tools required to run this pipeline are listed below:




Workshop setup

First we need to setup the enviroment to analyse the data in the workshop. This analysis will be performed in an interactive session, which is setup as followed:

# Setting master paths for the whole workshop

#Don't forget to replace *your-username* with your ciw number 
output_dir_master=/scratch/ciw/ciw30/metagenomics/Output
raw_rna_seq_data=/scratch/ciw/ciw30/metagenomics/Raw_input
#Do not change this one
resources=/scratch/ciw/ciw40/Tools




Extracting unmapped reads form a BAM file

In the Cancer_Microbiome_workshop folder you will find a BAM file labeled 'crc_patient.hisat.bam' this is an RNA-seq sample which has been pre-aligned to the human reference genome (GRCh38)

To extract unmapped reads from the BAM file we will use a package called Samtools:

outputdir=$output_dir_master/unmapped
mkdir -p $outputdir

#Extract the Unmapped reads from the input Bam file
bsub -J "samtools" -e Error_log/samtools_err.log -o Log/samtools_out.log -n 10 "$resources/samtools/samtools view -b -f 4 $raw_rna_seq_data/crc_patient.hisat.bam > $outputdir/unmapped.bam"

We can use Samtools again to check to see how many reads we have left in our RNA-seq sample after we have extracted the unmapped reads and compare this to the initial number of reads we started with

  • As you can see only a small fraction of the total number of reads we started with remain
#Count the total number of reads in our initial BAM file
outputdir=$output_dir_master/read_counts
mkdir -p $outputdir
bsub -J "samtools" -e Error_log/samtools_raw_count_err.log -o Log/samtools_raw_count_out.log -n 10 "$resources/samtools/samtools view -c $raw_rna_seq_data/crc_patient.hisat.bam > $outputdir/Raw_read_count.txt"

#Count the number of unmapped reads we have extracted
bsub -J "samtools" -e Error_log/samtools_unmapped_read_count_error.err -o Log/samtools_unmapped_read_count_out.log -n 5 "$resources/samtools/samtools view -c $output_dir_master/unmapped/unmapped.bam > $output_dir_master/read_counts/Unmapped_read_count.txt"




Convert BAM file to FASTQ Format

Before we can proceed further to any downstream processing of our unmapped reads we must convert our BAM file to FASTQ Format

In this example we will be using the Bedtools package - BioBamBam is also a commonly used package for this step, especially if you are working with a mixed library

outputdir=$output_dir_master/bedtools
mkdir -p $outputdir
bsub -J "bedtools" -e Error_log/bedtools_error.err -o Log/bedtools_out.log -n 5 $resources/bedtools2/bin/bedtools bamtofastq -i $output_dir_master/unmapped/unmapped.bam -fq $outputdir/unmapped.fq




De novo assembly with SPAdes

SPAdes analyses the reads and firstly arranges them into contigs and scaffolds

  • Contigs are defined as regions which overlap between sequences that together represent a consensus sequence

  • Scaffolds consist of overlapping contigs separated by gaps of known length

To run de novo assembly we will be using a python package called rnaspades.py

#Make an output directory called 'spades'
outputdir=$output_dir_master/spades
mkdir -p $outputdir


#Running rnaspades.py
# -s parameter indicates our input file consists of single-end RNA-seq data if we are using data with paired-end reads however instead of the -s parameter we should use -1 for the forward strand -2 for the reverse strand
# -o indicates the output file we want to store to our results in

bsub -J "Spades" -e Error_log/spade_error.err -o Log/spade_out.log -n 10 python $resources/SPAdes/bin/rnaspades.py -s $output_dir_master/bedtools/unmapped.fq -o $outputdir --only-assembler

Now lets have a look at the output file and count the number of scaffolds SPAdes has identified from our sample:

# less allows you to read the contents of a file, to quit viewing this file just type the letter q and you will return to your terminal
less $output_dir_master/spades/transcripts.fasta

Your output should look something like this:

Each sequence in the FASTA file has a header In this case: >NODE_1_length_3986_cov_17.47736 Each number in the notation of the header has a different meaning:

  • 1 is the number of the contig/scaffold

  • 3986 is the sequence length in nucleotides

  • 17.47736 is the k-mer coverage for the last (largest) k value used. Note that the k-mer coverage is always lower than the read (per-base) coverage.

Next lets count the number of scaffolds SPAdes has identified from our sample:

# Count the number of contigs using grep
grep -e '>' $output_dir_master/spades/transcripts.fasta | wc -l 




Contig/Scaffold Classification with Kraken2

To charecterise the origin of our SPAdes output we will now run Kraken2

Kraken2 uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence. The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.

Kraken2 allows you to build several combinations of custom databases however in this example we will be using minikraken2

Minikraken2 is compiled by compressing the complete bacterial, archaeal, and viral RefSeq genomes from NCBI.

  • Minikraken2 is a compressed database and contains only 2.7% of kmers from the original Kraken2 database
outputdir=$output_dir_master/kraken
mkdir -p $outputdir
minikraken=/scratch/ciw/ciw40/Resources/minikraken2_v1_8GB/
bsub -J "kraken2" -e Error_log/kraken2_error.err -o Log/kraken2_out.log "$resources/kraken2/kraken2 --use-names --db $minikraken $output_dir_master/spades/transcripts.fasta --report $outputdir/unmapped_reads.report.txt  > $outputdir/unmapped_reads.kraken"

Now lets look at the output :

# less allows you to read the contents of a file, to quit viewing this file just type the letter q and you will return to your terminal
less $outputdir/unmapped_reads.report.txt

The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

  • Percentage of reads covered by the clade rooted at this taxon
  • Number of reads covered by the clade rooted at this taxon
  • Number of reads assigned directly to this taxon
  • A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply “-“.
  • NCBI Taxonomy ID
  • The indented scientific name




Species and Genus Level Relative Abundance Estimations with Bracken

Bracken stands for Bayesian Re-estimation of Abundance with KrakEN, and is a statistical method that computes the abundance of species in DNA sequences from a metagenomics sample [LU,2017].

Bracken uses the taxonomy labels assigned by Kraken2 (see above) to estimate the number of reads originating from each species present in a sample.

Bracken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species.

Combined with the Kraken classifier, Bracken will produces more accurate species- and genus-level abundance estimates than Kraken2 alone.

#To run bracken we need to 
outputdir=$output_dir_master/braken
mkdir -p $outputdir
bsub -J "bracken" -e Error_log/bracken_error.err -o Log/bracken_out.log $resources/Bracken/bracken -d $minikraken -i $output_dir_master/kraken/unmapped_reads.report.txt -l S -o $outputdir/unmapped.bracken




Data Visualisation with Krona

Krona tools is a package which allows us to create a nice interactive visualisation of all of the bacterial and viral genes we have identified and quanitifed from our colorectal cancer sample

First we need to prepare the input file for Krona, to do this we need to take column 1 and column 5 from our Kraken report file

outputdir=$output_dir_master/krona
mkdir -p $outputdir

cat $output_dir_master/kraken/unmapped_reads.kraken | cut -f 1,5 > $outputdir/unmapped.kraken.krona

bsub -J "Krona" -e Error_log/Krona_error.err -o Log/Krona_out.log $resources/KronaTools-2.7.1/scripts/ImportTaxonomy.pl $outputdir/unmapped.kraken.krona -o $outputdir/unmapped-krona.html